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Abstract 

Processors  in  large-scale  multiprocessors  must  be  able 
to  tolerate  large  communication  latencies  and  synchro¬ 
nization  delays.  This  paper  describes  tue  architecture 
of  a  rapid-context-switching  processor  called  APRIL 
with  support  for  fine-grain  threads  and  synchroniza¬ 
tion.  APR[L  achieves  high  single-thread  performance 
and  supports  virtual  dynamic  threads.  A  commercial 
RISC-based  implementation  of  APRIL  and  a  run-time 
software  system  that  can  switch  contexts  in  about  10 
cycles  is  described.  Measurements  taken  for  several  par¬ 
allel  applications  on  an  APRIL  simulator  show  that  the 
overhead  for  supporting  parallel  tasks  based  on  futures 
is  reduced  by  a  factor  of  two  over  a  corresponding  im¬ 
plementation  on  the  Encore  Multimax.  The  scalability 
of  a  multiprocessor  based  on  APRIL  is  explored  using 
a  performance  model.  We  show  that  the  SPARC-based 
implementation  of  APRIL  can  achieve  close  to  80%  pro¬ 
cessor  utilization  with  as  few  as  three  resident  threads 
per  processor  in  a  large-scale  cache-based  machine  with 
an  average  base  network  latency  of  55  cycles. 

1  Introduction 

The  requirements  placed  on  a  processor  in  a  large-scale 
multiprocessing  environment  are  different  from  those  in 
a  uniprocessing  setting.  A  processor  in  a  parallel  ma¬ 
chine  must  be  able  to  tolerate  high  memory  latencies 
and  handle  process  synchronization  efficiently  [2].  This 
need  increases  as  more  processors  are  added  to  the  sys¬ 
tem. 
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Parallel  applications  impose  processing  and  commu 
nication  bandwidth  demands  on  the  parallel  machine. 
An  efficient  and  cost-effective  machine  design  achieves  a 
balance  between  the  processing  power  and  the  commu¬ 
nication  bandwidth  provided.  An  imbalance  is  created 
when  an  underutilized  processor  cannot  fully  exploit  the 
available  network  bandwidth.  When  the  network  has 
bandwidth  to  spare,  low  processor  utilization  can  re¬ 
sult  from  high  network  latency.  An  efficient  processor 
design  for  multiprocessors  provides  a  means  for  hiding 
latency.  When  sufficient  parallelism  exists,  a  processor 
that  rapidly  switches  to  an  alternate  thread  of  computa¬ 
tion  during  a  remote  memory  request  can  achieve  high 
utilization. 

Processor  utilization  also  diminishes  due  to  synchro¬ 
nization  latency.  Spin  lock  accesses  have  a  low  over¬ 
head  of  memory  requests,  but  busy-waiting  on  a  syn¬ 
chronization  event  wastes  processor  cycles.  Synchro¬ 
nization  mechanisms  that  avoid  bus}-  waiting  through 
process  blocking  incur  a  high  overhead. 

Full/empty  bit  synchronization  [22]  in  a  rapid  context 
switching  processor  allows  efficient  fine-grain  synchro¬ 
nization.  This  scheme  associates  synchronization  infor¬ 
mation  with  objects  at  the  granularity  of  a  data  word, 
allowing  a  low-overhead  expression  of  maximum  con¬ 
currency.  Because  the  processor  can  rapidly  switch  to 
other  threads,  wasteful  iterations  in  spin-wait  loops  are 
interleaved  with  useful  work  from  other  threads.  This 
reduces  the  negative  effects  of  synchronization  on  pro¬ 
cessor  utilization. 

This  paper  describes  the  architecture  of  APRIL 
a  processor  designed  for  large-scale  multiprocessing. 
APRIL  builds  on  previous  research  on  processors  for 
parallel  architectures  such  as  HEP  [22],  MASA  [8],  P- 
RISC  [19],  [14],  [15],  and  [18].  Most  of  these  processors 
support  fine-gram  interleaving  of  instruction  streams 
from  multiple  threads,  but  suffer  front  poor  single- 
thread  performance.  In  the  IIEP,  for  example,  instruc¬ 
tions  from  a  single  thread  can  only  be  executed  once 
every  8  cycles.  Single-thread  performance  is  important 
for  efficiently  running  sections  of  applications  with  low 


parallelism. 

A  PHIL  does  not  support  rye  le-by-cycle  interleaving 
of  threads  To  optimize  single- thread  performance, 
A  PHIL  executes  instructions  from  a  given  thread  until 
it  performs  a  remote  memory  request  or  fails  in  a  syn¬ 
chronization  attempt.  We  show  that  such  coarse-grain 
multithreading  allows  a  simple  processor  design  with 
context  switch  overheads  of  4-10  cycles,  without  sig¬ 
nificantly  hurting  overall  system  performance  (although 
the  pipeline  design  is  complicated  by  the  need  to  handle 
pipeline  dependencies).  In  APRIL,  thread  scheduling  is 
done  in  software,  and  unlimited  virtual  dynamic  threads 
are  supported.  APRIL  supports  full/empty  bit  synchro¬ 
nization,  and  provides  tag  support  for  futures  [9],  In  this 
paper  the  terms  process,  thread,  context,  and  task  are 
used  equivalently. 

Uy  taking  a  systems- level  design  approach  that  con¬ 
siders  not  only  the  processor,  but  also  the  compiler  and 
run-time  system,  we  were  able  to  migrate  several  non* 
critical  operations  into  the  software  system,  greatly  sim¬ 
plifying  processor  design.  APRIL'S  simplicity  allows  an 
implementation  based  on  minor  modifications  to  an  ex¬ 
isting  RISC  processor  design.  We  describe  such  an  in: 
piementation  based  on  Sun  Microsystem's  SPARC  pro¬ 
cessor  [2.1].  A  compiler  for  APRIL,  a  run-time  system, 
and  an  APRIL  simulator  are  operational.  We  present 
simulation  results  for  several  parallel  applications  on 
APRIL’S  efficiency  in  handling  fine-grain  threads  and 
assess  the  scalability  of  multiprocessors  based  on  a 
coarse-grain  multithreaded  processor  using  an  analyt¬ 
ical  model.  Our  SPARC-based  processor  supports  four 
hardware  contexts  and  can  switch  contexts  in  about  10 
cycles,  which  yields  roughly  80%  processor  utilization 
in  a  system  with  an  average  base  network  latency  of  55 
cycles. 

The  rest,  of  this  paper  is  organized  as  follows.  Sec¬ 
tion  2  is  an  overview  of  our  multiprocessor  system  archi¬ 
tecture  and  the  programming  model.  The  architecture 
of  APRIL  is  discussed  in  Section  3,  and  its  instruction 
set.  is  described  in  Section  i.  A  SPARC-based  imple¬ 
mentation  of  APRIL  is  detailed  in  Section  5.  Section  0 
discusses  the  implementation  and  performance  of  the 
A  PH  If  rim- time  system.  Performance  measurements  of 
APRIL  baspd  on  simulations  are  presented  in  Section  7. 
We  evaluate  the  scalability  of  multithreaded  processor; 
in  ^'  ction  8. 

2  The  ALEWIFE  System 

APRIL  is  the  processing  element  of  ALEWIFE,  a  large- 
scale  multiprocessor  being  designed  at  MIT.  ALEWIFE 
is  a  cache-coherent  machine  with  distributed,  globally- 
shared  memory.  Cache  coherence  is  maintained  using 


Figure  1:  ALEWIFE  node 


a  directory-based  protocol  [5]  over  a  low-dimension  di 
reel  network  [20].  The  directory  is  distributed  with  Lh.' 
processing  nodes. 

2.1  Hardware 

As  shown  in  Figure  l,  each  ALEWIFE  node  consists  of 
a  processing  element,  floating-point  unit.,  cache,  main 
memory,  cache/directory  controller  and  a  network  rout¬ 
ing  switch.  Multiple  nodes  are  connected  via  a  direct, 
packet-switched  network. 

The  controller  synthesizes  a  global  shared  memory 
space  via  messages  toother  nodes,  and  satisfies  requests 
from  other  nodes  directed  to  its  local  memory  It  main¬ 
tains  strong  cache  coherence  [7]  for  memory  accesses. 
On  exception  conditions,  such  as  cache  misses  and  failed 
synchronization  attempts,  the  controller  can  choose  to 
trap  the  processor  or  to  make  the  processor  wait.  A 
multithreaded  processor  reduces  the  ill  effects  of  the 
long-latency  acknowledgment  messages  resulting  from 
a  strong  cache  coherence  protocol.  To  allow  experimen¬ 
tation  with  other  programming  models,  the  controller 
provides  special  mechanisms  for  bypassing  the  coher¬ 
ence  protocol  and  facilities  for  preemptive  interproces- 
sor  interrupts  and  block  transfers. 

The  ALEWIFE  system  uses  a  low-dimension  direct 
network.  Such  networks  scale  easily  and  maintain  high 
nearest-neighbor  bandwidth.  However,  the  longer  ex 
peeled  latencies  of  low-dimension  direct  networks  com¬ 
pared  to  indirect,  multi  '  .ge  .."♦works  increase  the  need 
for  processors  that,  can  tolerate  long  latencies,  f  urther¬ 
more,  thp  lower  bandwidth  of  direct  net  works  over  indi¬ 
rect  networks  with  the  i  liannc!  .wdi.l.  I rodn.  es 

interesting  design  tradeoffs 

In  the  ALEWIFE  system,  a  context  switch  occurs 
whenever  the  network  must  be  used  to  satisfy  a  re 
quest,  or  on  a  failed  synchronization  attempt.  Since 
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caches  reduce  the  network  ret|iiest  rate,  we  can  em¬ 
ploy  eoarse-grain  multithreading  (context  switch  ev¬ 
ery  50-100  cycles)  instead  of  fine-grain  multithreading 
(context  switch  every  cycle).  This  simplifies  proces¬ 
sor  design  considerably  because  context  switches  can  be 
more  expensive  (d  to  10  cycles),  and  functionality  such 
as  scheduling  can  be  migrated  into  run-time  software. 
Single-thread  performance  is  optimized,  and  techniques 
used  in  RISC  processors  for  enhancing  pipeline  perfor¬ 
mance  can  be  applied  [10].  Custom  design  of  a  process¬ 
ing  element  is  not  required  in  the  ALEVVIFE  system; 
indeed,  we  are  using  a  modified  version  of  a  commercial 
RISC  processor  for  our  first-round  implementation. 

2.2  Programming  Model 

Our  experimental  programming  language  for  ALEVVIFE 
is  Mul-T  [16],  an  extended  version  of  Scheme.  Mul-T’s 
basic  mechanism  for  generating  concurrent  tasks  is  the 
future  construct.  The  expression  (future  A'),  where 
A'  is  an  arbitrary  expression,  creates  a  task  to  evaluate 
A  and  also  creates  an  object  known  as  a  future  to  even¬ 
tually  hold  the  value  of  A'.  When  created,  the  future 
is  in  an  unresolved ,  or  undetermined,  state.  When  the 
value  of  A  becomes  known,  the  future  resolves  to  that 
value,  effectively  mutating  into  the  value  of  A*.  Con¬ 
currency  arises  because  the  expression  (future  A')  re¬ 
turns  the  future  as  its  value  without  waiting  for  the 
future  to  resolve.  Thus,  the  computation  containing 
(future  A)  can  proceed  concurrently  with  the  evalu¬ 
ation  of  A".  All  tasks  execute  in  a  shared  address-space. 

The  result  of  supplying  a  future  as  an  operand  of 
some  operation  depends  on  the  nature  of  the  operation. 
Non-stnct  operations,  such  as  passing  a  parameter  to 
a  procedure,  returning  a  result  from  a  procedure,  as¬ 
signing  a  value  to  a  variable,  and  storing  a  value  into  a 
field  of  a  data  structure,  can  treat  a  future  just  like  any 
other  kind  of  value.  Strict  operations  such  as  addition 
and  comparison,  if  applied  to  an  unresolved  future,  are 
suspended  until  the  future  resolves  and  then  proceed, 
using  the  value  to  which  the  future  resolved  as  though 
that  had  been  the  original  operand. 

The  act  of  suspending  if  an  object  is  an  unresolved 
future  and  then  proceeding  when  the  future  resolves  is 
known  as  iourhing  the  object.  The  touches  that  auto¬ 
matically  occur  when  strict  operations  are  attempted 
are  referred  to  as  implicit  touches.  Mill  T  also  includes 
an  explicit  touening  or  “strict”  primitive  (touch  A') 
that  touches  the  value  of  the  expression  V  and  t'’»n 
returns  that  value. 

Futures  express  control-level  parallelism.  In  a  large 
class  of  algorithms,  data  parallelism  is  more  appropri¬ 
ate.  Barriers  are  a  useful  means  of  synchronization  for 


such  applications  on  MEMO  machines,  but  force  unnec¬ 
essary  serialization.  The  same  serialization  occurs  in 
SIMD  machines.  Implementing  data-level  parallelism 
in  a  MtMD  machine  that  allows  the  expression  of  maxi¬ 
mum  concurrency  requires  cheap  fine-grain  synchroniza¬ 
tion  associated  with  each  data  object.  We  provide  this 
support  in  hardware  with  full/empty  bits. 

We  are  augmenting  Mul-T  with  const, acts  for  data- 
level  parallelism  and  primitives  for  placement  of  data 
and  tasks.  As  an  example,  the  programmer  can  use 
iutura-on  which  works  just  like  a  normal  future  but 
allows  the  specification  of  the  node  on  which  to  schedule 
the  future.  Extending  Mul-T  in  this  way  allows  us  to 
experiment  with  techniques  for  enhancing  locality  and 
to  research  language-level  issues  for  programming  par¬ 
allel  machines. 

3  Processor  Architecture 

ABRIL  is  a  pipelined  RISC  processor  extended  with 
special  mechanisms  for  multiprocessing  This  section 
gives  an  overview  of  the  APRIL  architecture  and  fo¬ 
cuses  on  its  features  that  support  multithreading,  fine- 
grain  synchronization,  cheap  futures,  and  other  models 
of  computation. 

The  left  half  of  Figure  2  depicts  the  user-visible  pro¬ 
cessor  state  comprising  four  sets  of  general  purpose  reg¬ 
isters,  and  four  sets  of  Program  Counter  (PC)  chains 
and  Processor  State  Registers  (PSR).  The  PC  chain 
represents  the  instruction  addresses  corresponding  to 
a  thread,  and  the  PSR  holds  various  pieces  of  process- 
specific  state.  Each  register  set,  together  with  a  single 
PC-chain  and  PSR,  is  conceptually  grouped  into  a  single 
entity  called  a  task  frame  (using  terminology  from  [8]). 
Only  one  task  frame  is  active  at  a  given  time  and  is 
designated  by  a  current  frame  pointer  (FP).  All  reg¬ 
ister  accesses  are  made  to  I  he  active  register  set  and 
instructions  are  fetched  using  the  active  PC-chain.  Ad¬ 
ditionally,  a  set  of  8  global  registers  that  are  always 
accessible  (regardless  of  the  FP)  is  provided. 

Registers  are  32  bits  wide.  The  PSR  is  also  a  32-bit 
register  and  can  be  read  in'o  and  written  from  the  gen¬ 
eral  registers.  Special  instructions  can  read  and  write 
the  FP  register.  The  PC-chain  includes  the  Program 
Counter  (PC)  and  next  Program  Counter  (nPC)  which 
are  not  directly  accessible.  This  assumes  a  single-cycle 
branch  delay  slot.  Condition  codes  are  set  as  a  side 
effect  of  comoute  instructions.  *  Irngrr  Kr.'h  d«p.v 
might  be  necessary  if  the  branch  instruction  itself  does  a 
compare  so  that  condition  codes  need  not  be  saved  [13]; 
in  this  case  the  PC  chain  is  correspondingly  longer. 
Words  in  memory  have  a  32  bit  data  field,  and  have 
an  additional  synchronization  bit  called  the  fnll/empi y 
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□  Unloaded  thread 
H  Loaded  thread 


Figure  2:  Processor  State  and  Virtual  Threads, 
bit. 

Use  of  multiple  register  sets  on  the  processor,  as  in  the 
HEP,  allows  rapid  context  switching.  A  context  switch 
is  achieved  by  changing  the  frame  pointer  and  empty¬ 
ing  the  pipeline.  The  cache  controller  forces  a  context 
switch  on  the  processor,  typically  on  remote  network  re¬ 
quests,  and  on  certain  unsuccessful  full/empty  bit  syn¬ 
chronizations. 

APRIL  implements /uhiresusing  the  trap  mechanism. 
For  our  proposed  experimental  implementation  based 
on  SPARC,  which  does  not  have  four  separate  PC  and 
PSR  frames,  context  switches  are  also  caused  through 
traps.  Therefore,  a  fast  trap  mechanism  is  essential. 
When  a  trap  is  signalled  in  APRIL,  the  trap  mechanism 
lets  the  pipeline  empty  and  passes  control  to  the  trap 
handler.  The  trap  handler  executes  in  the  same  task 
frame  as  the  thread  that  trapped  so  that  it  can  access 
all  of  the  thread’s  registers. 

3.1  Coarse-Grain  Multithreading 

In  most  processor  designs  to  date  (e  g.  [8,  22,  19,  15]), 
multithreading  has  involved  cycle-by-cycle  interleaving 
of  threads.  Such  fme-grain  multithreading  has  been 
used  to  hide  memory  latency  and  also  to  achieve  high 
pipeline  utilization.  Pipeline  dependencies  are  avoided 
by  maintaining  instructions  from  different  threads  in  the 
pipeline,  at  the  price  of  poor  single-thread  performance 

In  the  ALEWIFE  machine,  we  are  primarily  con¬ 
cerned  with  the  large  latencies  associated  with  cache 
misses  that  require  a  network  access.  Good  sin¬ 
gle  thread  performance  is  also  important.  Therefore 
APRIL  continues  executing  a  single  thread  until  a  mem¬ 


ory  operation  involving  a  remote  request  (or  an  uns;e  - 
cessful  synchronization  attempt)  is  encountered,  i  lie 
controller  forces  the  pr<'-.-ssor  to  switch  to  another 
thread,  while  it  services  the  request.  This  approach  is 
called  coarse-gram  multithreading.  Processors  in  mes¬ 
sage  passing  multicomputers  [21,  27,  6,  4]  have  tra¬ 
ditionally  taken  this  approach  to  allow  overlappii.0  of 
communication  with  computation. 

Context  switching  in  APRIL  is  achieved  by  changing 
the  frame  pointer.  Since  APRIL  has  four  task  frames, 
it  can  have  up  to  four  threads  loaded.  The  thread  that 
is  being  executed  resides  in  the  task  frame  pointed  t<> 
by  the  FP.  A  context  switch  simply  involves  letting  the 
processor  pipeline  empty  while  saving  the  PC-chain  an  : 
then  changing  the  FP  to  point  to  another  task  frame 

Threads  in  ALEWIFE  are  virtual.  Only  a  small  sub¬ 
set  of  all  threads  can  be  physically  resident  on  the  pro¬ 
cessors;  these  threads  are  called  loaded  threads.  The  re¬ 
maining  threads  are  referred  to  as  unloaded  threads  ami 
live  on  various  queues  in  memory,  waiting  their  turn 
to  be  loaded.  In  a  sense,  the  set  of  task  frames  acts 
like  a  cache  on  the  virtual  threads.  This  organization 
is  illustrated  in  Figure  2.  The  scheduler  tries  to  choose 
threads  from  the  set  of  loaded  threads  for  execution  to 
minimize  the  overhead  of  saving  and  restoring  threads 
to  and  from  memory.  When  control  eventually  passes 
back  to  the  thread  that  suffered  a  remote  request,  the 
controller  should  have  completed  servicing  the  request, 
provided  the  other  threads  ran  for  enough  cycles.  By 
maximizing  local  cache  and  memory  accesses,  the  need 
for  context  switching  reduces  to  once  every  50  or  100 
cycles,  which  allows  us  to  tolerate  latencies  in  the  range 
of  150  to  300  cycles  with  4  task  frames  (see  Section  8) 

Rapid  context  switching  is  used  to  hide  the  latency 
encountered  in  several  other  trap  events,  such  as  syn¬ 
chronization  faults  (or  attempts  to  load  from  “empty" 
locations).  These  events  can  either  cause  the  proces¬ 
sor  to  suspend  execution  (wait)  or  to  take  a  trap.  In 
the  former  case,  the  controller  holds  the  processor  until 
the  request  is  satisfied.  This  typically  happens  on  lo¬ 
cal  memory  cache  misses,  and  on  certain  full/empty  bit 
tests.  If  a  trap  is  taken,  the  trap  handling  routine  can 
respond  by: 

1.  spinning  -  immediately  return  from  the  trap  and 
retry  the  trapping  instruction. 

2.  switch  spinning  -  context  switch  without  unloading 
the  trapped  thread. 

3.  blocking  -  unload  the  thread. 

The  above  alternatives  must  be  considered  with  care 
because  incorrect  choices  can  create  or  exacerbate  star¬ 
vation  and  thrashing  problems.  An  extreme  example 
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of  starvation  is  this:  all  loaded  threads  are  spinning 
or  switch  spinning  on  an  exception  condition  that  an 
unloaded  thread  is  responsible  for  fulfilling.  We  are  in¬ 
vestigating  several  possible  mechanisms  to  handle  such 
problems,  including  a  special  controller  initiated  trap 
on  certain  failed  synchronization  tests,  whose  handler 
unloads  the  thread. 

An  important  aspect  of  the  ALEVV1FE  system  is  its 
combination  of  caches  and  multithreading.  While  this 
combination  is  advantageous,  it  also  creates  a  unique 
class  of  thrashing  and  starvation  problems.  For  exam¬ 
ple,  forward  progress  can  be  halted  if  a  context  execut¬ 
ing  on  one  processor  is  writing  to  a  location  while  a  con¬ 
text  on  another  processor  is  reading  from  it.  These  two 
contexts  can  easily  play  “cache  tag” ,  since  writes  to  a  lo¬ 
cation  force  a  context  switch  and  invalidation  of  other 
cached  copies,  while  reads  force  a  context  switch  and 
transform  read-write  copies  into  read-only  copies.  An¬ 
other  problem  involves  thrashing  between  an  instruction 
and  its  data;  a  context  will  be  blocked  if  it  has  a  load 
instruction  mapped  to  tire  same  cache  line  as  the  tar¬ 
get  of  the  load.  These  and  related  problems  have  been 
addressed  with  appropriate  hardware  interlock  mecha¬ 
nisms. 

3.2  Support  for  Futures 

Executing  a  Mul-T  program  with  futures  incurs  two 
types  of  overhead  not  present  in  sequential  programs. 
First,  strict  operations  must  check  their  operands  for 
availability  before  using  them.  Second,  there  is  a  cost 
associated  with  creating  new  threads. 

Detection  of  Futures  Operand  checks  for  futures 
done  in  software  imply  wasted  cycles  on  every  strict 
operation.  Our  measurements  with  Mul-T  running  on 
an  Encore  Mnltirnax  show  that  this  is  expensive.  Even 
with  clever  compiler  optimizations,  there  is  close  to  a 
factor  of  two  loss  in  performance  over  a  purely  sequen¬ 
tial  implementation  (see  'Fable  3).  Our  solution  em¬ 
ploys  a  tagging  scheme  with  hardware-generated  traps 
if  an  operand  to  a  strict  operator  is  a  future.  We  believe 
that  this  hardware  support  is  necessary  to  make  futures 
a  viable  construct  for  expressing  parallelism.  From  an 
architectural  perspective,  this  mechanism  is  similar  to 
dynamic  type  checking  in  Lisp.  However,  this  mecha¬ 
nism  is  necessary  even  in  a  statically  typed  language  in 
the  presence  of  dynamic  futures. 

APRIL  uses  a  simple  data  type  encoding  scheme  for 
automatically  generating  a  trap  when  operands  to  strict 
operators  are  futures.  This  implementation  (discussed 
in  Section  5)  obviates  the  need  to  explicitly  inspect 
in  software  the  operands  to  every  compute  instruction. 


This  is  important  because  we  do  not  want  to  hurt  the 
efficiency  of  all  compute  inst  ruct  ions  because  of  tire  pos¬ 
sibility  an  operand  is  a  future. 

Lazy  Task  Creation  Lit  tle  can  be  done  to  reduce  the 
cost  of  task  creation  if  future  is  taken  as  a  command 
to  create  a  new  task.  In  many  programs  the  possibility 
of  creating  an  excessive  number  of  fine-grain  tasks  ex¬ 
ists.  Our  solution  to  this  problem  is  called  lazy  task  cre¬ 
ation  [17].  With  lazy  task  creation  a  future  expression 
does  not  create  a  new  task,  but  computes  the  expression 
as  a  local  procedure  call,  leaving  behind  a  marker  indi¬ 
cating  that  a  new  task  could  have  been  created.  The 
new  task  is  created  only  when  some  processor  becomes 
idle  and  looks  for  work,  stealing  the  continuation  of  that 
procedure  call.  Thus,  the  user  can  specify  the  maximum 
possible  parallelism  without  the  overhead  of  creating  a 
large  number  of  tasks.  The  race  conditions  are  resolved 
using  the  fine-grain  locking  provided  by  the  full/empty 
bits. 

3.3  Fine-grain  synchronization 

Besides  support  for  lazy  task  creation,  efficient  fine- 
grain  synchronization  is  essential  for  large-scale  parallel 
computing.  Both  the  dataflow  and  data-parallel  models 
of  computation  rely  heavily  on  the  availability  of  cheap 
fine-grain  synchronization.  The  unnecessary  serializa¬ 
tion  imposed  by  barriers  in  MIMD  implementations  of 
data-parallellism  can  be  avoided  by  allowing  fine-grain 
word-level  synchronization  in  data  structures.  The  tra¬ 
ditional  testtset  based  synchronization  requires  extra 
memory  operations  and  separate  data  storage  for  the 
lock  and  for  the  associated  data.  Busy-waiting  or  block¬ 
ing  in  conventional  processors  waste  additional  proces¬ 
sor  cycles. 

APRIL  adopts  the  full/empty  bit  approach  used  in 
the  IIEP  to  reduce  both  the  storage  requirements  and 
the  number  of  memory  accesses.  A  bit  associated  with 
each  memory  word  indicates  the  state  of  the  word:  full 
or  empty.  The  load  of  an  empty  location  or  the  store 
into  a  full  location  can  trap  the  processor  causing  a 
context  switch,  which  helps  hide  synchronization  delay. 
Traps  also  obviate  the  additional  software  tests  of  the 
lock  in  testiset  operations.  A  similar  mechanism  is 
used  to  implement  1-structures  in  dataflow  machines  [3], 
however  APRIL  is  different  in  that  it  implements  such 
synchronizations  through  software  trap  handlers. 

3.4  Multimodel  Support  Mechanisms 

APRIL  is  designed  primarily  for  a  shared-memory  mul¬ 
tiprocessor  with  strongly  coherent  caches.  However, 
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we  are  considering  several  additional  mechanisms  which 
will  permit  explicit,  management  of  caches  and  efficient 
use  of  network  bandwidth.  These  mechanisms  present 
different  computational  models  to  the  programmer. 

To  allow  software-enforced  cache  coherence,  we  have 
loads  and  stores  that  bypass  the  hardware  coherence 
mechanism,  and  a  flush  operation  that  permits  soft¬ 
ware  writeback  and  invalidation  of  cache  lines.  A  loaded 
context  has  a  fence  counter  that  is  incremented  for 
each  dirty  cache  line  that  is  flushed  and  decremented 
for  each  acknowledgement  from  memory.  This  fence 
counter  may  be  examined  to  determine  if  all  writebacks 
have  completed.  We  are  proposing  a  block-transfer 
mechanism  for  efficient  transfer  of  large  blocks  of  data. 
Finally,  we  are  considering  an  interprocessor-interrupt 
mechanism  (IPI)  which  permits  preemptive  messages 
to  be  sent  to  specific  processors.  IPfs  offer  reasonable 
alternatives  to  polling  and,  in  conjunction  with  block- 
transfers,  form  a  primitive  for  the  message-passing  com¬ 
putational  model. 

Although  each  of  these  mechanisms  adds  complex¬ 
ity  to  our  cache  controller,  they  are  easily  implemented 
in  the  processor  through  “out-of-band”  instructions  as 
discussed  in  Section  5. 

4  Instruction  Set 

APRIL  has  a  basic  RISC  instruction  set  augmented 
with  special  memory  instructions  for  full/empty  bit  op¬ 
erations,  multithreading,  and  cache  support.  The  at¬ 
traction  of  an  implementation  based  on  simple  SPARC 
processor  modifications  has  resulted  in  a  basic  SPARC- 
like  design.  All  registers  are  addressed  relative  to  a  cur¬ 
rent  frame  pointer.  Compute  instructions  are  3-address 
register-t.o-register  arithmetic/logic  operations.  Condi¬ 
tional  branch  instructions  take  an  immediate  operand 
and  may  increment  the  PC  by  the  value  of  the  immedi¬ 
ate  operand  depending  on  the  condition  codes  set  by  the 
arithmetic/logic  operations.  Memory  instructions  move 
data  between  memory  and  the  registers,  and  also  inter¬ 
act  with  the  cache  and  the  full/empty  bits.  The  basic 
instruction  categories  are  summarized  in  Table  1.  The 
remainder  of  this  section  describes  features  of  APRIL 
instructions  used  for  supporting  multiprocessing. 

Data  Type  Formats  APRIL  supports  tagged  point¬ 
ers  for  Mul-T,  as  in  tiie  Derkeley  SPUR  processor  [12], 
by  encoding  the  pointer  type  in  the  low  order  bits  of  a 
data  word.  Associating  the  type  with  the  pointer  has 
the  advantage  of  saving  an  additional  memory  reference 
when  accessing  type  information.  Figure  3  lists  the  dif¬ 
ferent  type  encodings.  An  important  purpose  of  this 


Type 

Format 

Data  transfer 

Control  flow  • 

Compute 

op  8 1  82  d 

d  —  s  1  op  s2 

PC  +  1  1 

Memory 

Id  type  a  d 

d  —  mem[a] 

PC+1  j 

st  type  d  s 

mem [a]  *—  s 

PC+1 

Branch 

jeond  offset 

if  cond 

PC+of fset 

else  PC+1 

jmpl  offset  d 

d  —  PC 

PC+of fset 

Table  1:  Basic  instruction  set  summary. 


Bit  31 
| 

BitO 

| 

Fixnum  1 

lo  ol 

Other  1 

_ kxaJ 

Cons  f  ~ 

_ liiil 

Future  1 

_ RiiJ 

Figure  3. 

Data  Type  Encodings. 

type  encoding  scheme  is  to  support  hardware  detection 

of  futures. 

Future  Detection 

and  Compute  Instructions 

Since  a  compute  instruction  is  a  strict  operation,  special 
action  has  to  be  taken  if  either  of  its  operands  is  a  fu¬ 
ture.  APRIL  generates  a  trap  if  a  future  is  encountered 
by  a  compute  instruction.  Future  pointers  are  easily 
detected  by  their  non-zero  least  significant  bit. 

Memory  Instructions  Memory  instructions  are 
complex  because  they  interact  with  the  full/empty  bits 
and  the  cache  controller.  On  a  memory  access,  two  data 
exceptions  can  occur:  the  accessed  location  may  not  be 
in  the  cache  (a  cache  miss),  and  the  accessed  location 
may  be  empty  on  a  load  or  full  on  a  store  (a  full/empty 
exception).  On  a  cache  miss,  the  cache/directory  con¬ 
troller  can  trap  the  processor  or  make  the  processor 
wait  until  the  data  is  available.  On  full/empty  excep¬ 
tions,  the  controller  can  trap  the  processor,  or  allow  the 
processor  to  continue  execution.  Load  instructions  also 
have  the  option  of  setting  the  full/empty  bit  of  the  ac¬ 
cessed  location  to  empty  while  store  instructions  have 
the  option  of  setting  the  bit  to  full.  These  options  give 
rise  to  8  kinds  of  loads  and  8  kinds  of  stores.  The  load 
instructions  are  listed  in  Table  2.  Store  instructions  are 
similar  except  that  they  trap  on  full  locations  instead 
of  empty  locations. 

A  memory  instruction  also  shares  responsibility  for 
detecting  futures  in  either  of  its  address  operands.  Like 
compute  instructions,  memory  instructions  also  trap 
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Name 

type 

Reset  f/e  bit 

EL1  trap 

CNR  response 

ldtt 

1 

No 

Yes 

Trap 

ldett 

2 

Yes 

Yes 

Trap 

ldnt 

3 

No 

No 

Trap 

ldent 

4 

Yes 

No 

Trap 

ldn» 

5 

No 

No 

Wait 

ldenn 

6 

Yes 

No 

Wait 

ldts 

7 

No 

Yes 

W'ait 

IdetH 

8 

Yes 

Yes 

Wait 

Empty  location.  2Cache  miss. 


Table  2:  Load  Instructions. 


if  the  least  significant  bit  of  either  of  their  address 
operands  are  non-zero.  This  introduces  the  restriction 
that  objects  in  memory  cannot  be  allocated  at  byte 
boundaries.  This,  however,  is  not  a  problem  because 
object  allocation  at  word  boundaries  is  favored  for  other 
reasons  [II].  This  trap  provides  support  for  implicit  fu¬ 
ture  touches  in  operators  that  dereference  pointers,  e.g., 
car  in  LISP. 

Full/Empty  Bit  Conditional  Branch  Instructions 

Non-trapping  memory  instructions  allow  testing  of  the 
full/empty  bit  by  setting  a  condition  bit  indicating  the 
state  of  the  memory  word’s  full/empty  bit.  APRIL 
provides  conditional  branch  instructions,  Jfull  and 
Jempty,  that  dispatch  on  this  condition  bit.  This  pro¬ 
vides  a  mechanism  to  explicitly  control  the  action  taken 
following  a  memory  instruction  that  wouid  normally 
trap  on  a  full/empty  exception. 

Frame  Pointer  Instructions  Instructions  are  pro¬ 
vided  for  manipulating  the  register  frame  pointer  (FP). 
FP  points  to  the  register  frame  on  which  the  currently 
executing  thread  resides.  An  INCFP  instruction  incre¬ 
ments  the  FP  to  point  to  the  next  task  frame  while 
a  DECFP  instruction  decrements  it.  The  incrementing 
and  decrementing  is  done  modulo  the  number  of  task 
frames.  RDFP  reads  the  value  of  the  FP  into  a  register 
and  STFP  writes  the  contents  of  a  register  into  the  FP. 

Instructions  for  Other  Mechanisms  The  special 
mechanisms  discussed  in  Section  3.4,  such  as  FLUSH 
are  made  available  through  “out-of-band”  instructions. 
Interprocessor-interrupts,  block-transfers,  and  FENCE 
operations  are  initiated  via  memory-mapped  I/O  in¬ 
structions  (LDIO,  STIO). 


5  An  Implementation  of  APRIL 

An  ALFWIFE  node  consists  of  several  interacting  sub¬ 
systems:  processor,  floating-point  unit,  cache,  memory, 
cache  and  directory  controller,  and  network  controllei. 
For  the  first  round  implementation  of  the  ALEWIFE 
system,  we  plan  to  use  a  modified  SPARC'  processor 
and  an  unmodified  SPARC  floating-point,  unit.1  There 
are  several  reasons  for  this  choice.  First,  we  have  chosen 
to  devote  our  limited  resources  to  the  design  of  a  custom 
ALEWIFE  cache  and  directory  controller,  rather  than 
to  processor  design.  Second,  the  register  windows  in 
the  SPARC  processor  permit  a  simple  implementation 
of  coarse-grain  multithreading.  Third,  most  of  the  in¬ 
structions  envisioned  for  the  original  APRIL  processor 
map  directly  to  single  or  double  instruction  sequences 
on  the  SPARC.  Software  compatibility  with  a  commer¬ 
cial  processor  allows  easy  access  tc  a  large  body  of  soft¬ 
ware.  Furthermore,  use  of  a  standard  processor  permits 
us  to  ride  the  technology  curve;  we  can  take  advantage 
of  new  technology  as  it  is  developed. 

Rapid  Context  Switching  on  SPARC  SPARC 
processors  contain  an  implementation-dependent  num¬ 
ber  of  overlapping  register  windows  for  speeding  up  pro¬ 
cedure  calls.  The  current  register  window  is  altered 
via  SPARC  instructions  (SAVE  and  RESTORE)  that  mod¬ 
ify  the  Current  Window  Pointer  (CWP).  Traps  incre¬ 
ment  the  CWP,  while  the  trap  return  instruction  (RETT) 
decrements  it.  SPARC’s  register  windows  are  suited  for 
rapid  context  switching  and  rapid  trap  handling  because 
most  of  the  stale  of  a  process  (i.c.,  its  24  local  reg¬ 
isters)  can  be  switched  with  a  single-cycle  instruction. 
Although  we  are  not  using  multiple  register  windows  for 
procedure  calls  within  a  single  thread,  this  should  not 
significantly  hurt  performance  [25,  24], 

To  implement  coarse-grain  multithreading,  we  use 
two  register  windows  per  task  frame  -  a  user  window 
and  a  trap  window.  The  SPARC  processor  chosen  for 
our  implementation  has  eight  register  windows,  allow¬ 
ing  a  maximum  of  four  hardware  task  frames.  Since 
the  SPARC  does  not  have  multiple  program  counter 
(PC)  chains  and  processor  status  registers  (PSR),  our 
trap  code  must  explicitly  save  and  restore  the  PSRs 
during  context  switches  (the  PC  chain  is  saved  by  the 
trap  itself).  These  values  are  saved  in  the  trap  window. 
Decause  the  SPARC  has  a  minimum  trap  overhead  of 
five  cycles  (for  squashing  the  pipeline  and  computing 
the  trap  vector),  context  switches  will  take  at  least  this 
long.  See  Section  6.1  for  further  information 

The  SPARC  floating-point  unit  does  not  support  reg- 

1  The  SPARC-bascd  implementation  effort  is  in  collaboration 
with  LSI  Logic  Corporation. 
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ister  windows,  but  lias  a  single,  32-word  register  file. 
To  retain  rapid  context  switching  ability  for  applica¬ 
tions  that  require  efficient,  floating  point  performance, 
we  have  divided  the  floating  point  register  fil ;  into  four 
sets  of  eight  registers.  This  is  achieved  by  modifying 
floating-point  instructions  in  a  context  dependent  fash¬ 
ion  as  (hey  are  loaded  into  the  FPU  and  by  maintaining 
four  different  sets  of  condition  bits.  A  modification  of 
the  SPARC  processor  will  make  the  CWP  available  ex¬ 
ternally  to  allow  insertion  into  the  FPU  instruction. 

Support  for  Futures  We  detect  futures  on  the 
SPARC  via  two  separate  mechanisms.  Future  point¬ 
ers  are  tagged  with  their  lowest  bit  set.  Thus,  direct 
use  of  a  future  pointer  is  flagged  with  a  word-alignment 
trap.  Furthermore,  a  strict  operation,  such  as  subtrac¬ 
tion,  applied  to  one  or  more  future  pointers  is  flagged 
with  a  modified  non-fimum  trap,  that  is  triggered  if  an 
operand  has  its  lowest  bit  set  (as  opposed  to  either  one 
of  the  lowest  two  bits,  in  the  SPARC  specification). 

Implementation  of  Loads  and  Stores  The 
SPARC  definition  includes  the  Alternate  Space  Indi¬ 
cator  (ASI)  feature  that  permits  a  simple  implementa¬ 
tion  of  APRIL’S  many  load  and  store  instructions  (de¬ 
scribed  in  Section  4).  The  ASI  is  available  externally  as 
an  eight-bit  field.  Normal  memory  accesses  use  four  of 
the  25G  ASI  values  to  indicate  user/supervisor  and  in¬ 
struction/data  accesses.  Special  SPARC  load  and  store 
instructions  (LDASI  and  STASI)  permit  use  of  the  other 
252  ASI  values.  Our  first-round  implementation  uses 
different  ASI  values  to  distinguish  between  flavors  of 
load  and  store  instructions,  special  mechanisms,  and 
I/O  instructions. 

Interaction  with  the  Cache  Controller  The  cache 
controller  in  the  ALEWIFE  system  maintains  strong 
cache  coherence,  performs  full/empty  bit  synchroniza¬ 
tion,  and  implements  special  mechanisms.  By  examin¬ 
ing  the  processor's  ASI  bits  during  memory  accesses, 
it  can  select  between  different  load/store  and  synchro¬ 
nization  behavior,  and  can  determine  if  special  mecha¬ 
nisms  should  be  employed.  Through  use  of  the  Memory 
Exception  (MEXC)  line  on  SPARC,  it  can  invoke  syn¬ 
chronous  traps  corresponding  to  cache  misses  and  syn¬ 
chronization  (full/empty)  mismatches.  The  controller 
can  suspend  processor  execution  using  the  MIIOLD 
line.  It  passes  condition  information  to  the  processor 
through  the  Coprocessor  Condition  bits  (CCCs),  per¬ 
mitting  the  full/empty  conditional  branch  instructions 
(Jfull  and  Jempty)  to  be  implemented  as  coprocessor 
branch  instructions.  Asynchronous  traps  (IPI's)  are  de¬ 
livered  via  the  SPARC’s  asynchronous  trap  lines. 


6  Compiler  and  Run-Time  Sys¬ 
tem 

The  -ompiler  and  run-time  system  are  integral  parts 
of  the  processor  design  efTort,  A  Mul-T  compiler  for 
APRIL  and  a  run-time  system  written  partly  in  APRIL 
assembly  code  and  partly  in  T  have  been  implemented 
Constructs  for  user-directed  placement  of  data  and  pro¬ 
cesses  have  also  been  implemented  The  run-time  sys 
tem  includes  the  trap  and  system  routines.  Mul-T  run 
time  support,  a  scheduler,  and  a  system  boot  routine 

Since  a  large  portion  of  the  support,  for  multithread¬ 
ing,  synchronization  and  futures  is  provided  in  soft¬ 
ware  through  traps  and  run-time  routines,  trap  bar 
dling  must  be  fast.  Below,  we  describe  the  implemerv 
tation  and  performance  of  the  routines  used  for  trap 
handling  and  context  switching. 

6.1  Cache  Miss  and  Full/Empty  Traps 

Cache  miss  traps  occur  on  cache  misses  that  require 
a  network  request  and  cause  the  processor  to  context 
switch.  Full/empty  synchronization  exceptions  ran  oc¬ 
cur  on  certain  memory  inst  rti  -tions  described  in  Sec¬ 
tion  4.  The  processor  can  respond  to  these  exceptions 
by  spinning,  switch  spinning,  or  blocking  the  thread 
In  our  current  implementation,  traps  handle  these  ex¬ 
ceptions  by  switch  spinning,  which  involves  a  context 
switch  to  the  next  task  frame. 

In  our  SPARC-based  design  of  APRIL,  we  implement 
context  switching  through  the  trap  mechanism  using 
instructions  that  change  the  CWP.  The  following  is  a 
trap  routine  that  context  switches  to  the  thread  in  the 
next  task  frame. 

rdpsr  psrreg  ;  save  PSR  into  a  reserved  reg. 
save  ;  increment  the  windoo  pointer 

save  ;  by  2 

wrpsr  psrreg  ;  restore  PSR  for  the  new  context 

jmpl  rl7  ;  return  from  trap  and 

rett  rl8  ;  reexecute  trapping  instruction 

We  count  5  cycles  for  the  trap  mechanism  to  allow 
the  pipeline  to  empty  and  save  relevant  processor  state 
before  passing  control  to  the  trap  handler.  The  above 
trap  handler  takes  an  additional  6  cycles  for  a  total  of  1 1 
cycles  to  effect  the  context  switch.  In  a  custom  APRIL 
implementation,  the  cycles  lost  due  to  PC  saves  in  the 
hardware  trap  sequence,  and  those  in  calling  the  trap 
handler  for  the  PSR  saves/restores  and  double  incre¬ 
menting  the  frame  pointer  could  be  avoided,  allowing  a 
four-cycle  context  switch. 


8 


Other  parallel  tracers: 
trap  bn.  ATUM2 


6.2  Future  Touch  Trap 

When  a  future  touch  trap  is  signalled,  the  future  that 
caused  the  trap  will  he  in  a  register.  The  trap  han¬ 
dler  has  to  decode  the  trapping  instruction  to  find  that 
register  The  future  is  resolved  if  the  full /empty  bit  of 
the  future's  value  slot  is  set  to  full.  If  it  is  resolved, 
the  future  in  the  register  is  replaced  with  the  resolved 
value:  otherwise  the  trap  routine  can  decide  to  switch 
sinn  or  block  the  thread  that  trapped.  Our  future  touch 
trap  handler  takes  23  cycles  to  execute  if  the  future  is 
resolved. 

If  the  trap  handler  decides  to  block  the  thread  on  an 
unresolved  future,  the  thread  must  be  unloaded  from 
the  hardware  task  frame,  and  an  alternate  thread  may 
be  loaded.  Loading  a  thread  involves  writing  the  state  of 
the  thread,  including  its  general  registers,  its  PC  chain, 
and  its  P.SR,  into  a  hardware  task  frame  on  the  pro 
cessor,  and  unloading  a  thread  involves  saving  the  state 
of  a  thread  out.  to  memory.  Loading  and  unloading 
threads  are  expensive  operations  unless  there  is  special 
hardware  support  for  block  movement  of  data  between 
registers  and  memory.  Since  the  scheduling  mechanism 
favors  processor-resident  threads,  loading  and  unload¬ 
ing  of  threads  should  be  infrequent..  However,  this  is  an 
issue  that  is  under  investigation. 

7  Performance  Measurements 

This  section  presents  some  results  on  APRIL’S  perfor¬ 
mance  in  handling  fine-grain  tasks.  We  have  imple¬ 
mented  a  simulator  for  the  ALEWIFE  system  written 
in  C  and  T.  Figure  4  illustrates  the  organization  of  the 
simulator.  The  Mul-T  compiler  produces  APRIL  code, 
which  gets  linked  with  the  run  time  system  to  yield  an 
executable  program.  The  instruction-level  APRIL  pro¬ 
cessor  simulator  interprets  APRIL  instructions.  It  is 
written  in  T  and  simulates  40,000  APRIL  instructions 
per  second  when  run  on  a  SPARCServer  .330.  The  pro¬ 
cessor  simulator  interacts  with  the  cache  and  directory 
simulator  (written  in  C)  on  memory  instructions.  The 
cache  simulator  in  turn  interacts  with  C.e  network  sim¬ 
ulator  (also  written  in  C)  when  making  remote  memory 
operations.  The  simulator  has  proved  to  be  a  useful 
tool  in  evaluating  system-wide  architectural  tradeoffs 
as  it  provides  more  accurate  results  than  a  trace  driven 
simulation.  The  speed  of  the  sir..ii,ator  has  allowed  us 
to  execute  lengthy  parallel  programs.  As  an  example,  in 
a  run  of  speech  (described  below),  the  simulated  pro¬ 
gram  ran  for  100  million  simulated  cycles  before  com¬ 
pleting. 

Evaluation  of  the  ALEWIFE  architecture  through 
simulations  is  in  progress.  A  sampling  of  our  results  on 


Single  processor 
execution  trace  ol 

Mul-T  program  parallel  program 


Figure  4:  Simulator  Organization. 


the  performance  of  APRIL  running  parallel  programs 
is  presented  here.  Table  3  lists  the  execution  times  of 
four  programs  written  in  Mul-T:  fib,  factor,  queens 
and  speech,  fib  is  the  ubiquitous  doubly  recursive  Fi¬ 
bonacci  program  with  'future's  around  each  of  its  re¬ 
cursive  calls,  factor  finds  tb**  largest  prime  factor  of 
eacli  number  in  a  range  of  numbers  and  sums  them  up. 
quecus  finds  all  solutions  to  the  n-que<  ns  chess  prob¬ 
lem  for  n  —  8  and  speech  is  a  modified  Vit.erbi  graph 
search  algorithm  used  in  a  connected  speech  recognition 
system  called  SUMMIT,  developed  by  the  Spoken  Lan¬ 
guage  Systems  Group  at  MIT.  We  ran  each  program 
on  the  Encore  Multimax,  on  APRIL  using  normal  task 
creation,  and  on  APRIL  using  lazy  task  creation.  For 
purposes  of  comparison,  execution  time  has  been  nor¬ 
malized  to  the  time  taken  to  execute  a  sequent  ial  \  ’rsion 
of  each  program,  re.,  with  no  futu.es  and  compiled  with 
an  optimizing  T-compiler. 

The  difference  between  running  the  same  sequential 
code  on  T  and  on  Mul-T  on  the  Encorr  Multimax 
(columns  “T  seq”  and  “Mul-T  seq”)  is  due  to  ire  over¬ 
head  of  future  detection.  Since  the  Encme  does  not 
suppou.  hardware  detection  of  futures,  an  overhead  ol  a 
factor  of  2  is  introduced,  even  though  no  futures  are  ac¬ 
tually  created.  There  is  no  overhead  on  APRIL,  which 
demonstrate:-  the  advantage  of  tag  support  for  futures. 

The  difference  between  running  sequential  code  on 
Mul-T  and  running  parallel  code  on  Mul-T  with  one 
processor  (“Mul-T  seq”  and  1)  is  due  to  the  o'-  rhead 
of  thread  creation  and  synchronization  in  a  parallel  pro- 
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grain.  This  overhead  is  very  large  for  the  fib  benchmark 
on  both  the  Encore  and  APRIL  using  normal  task  cre¬ 
ation  because  of  very  fine-grain  thread  creation.  This 
overhead  accounts  for  approximately  a  factor  of  28  in 
execution  time  For  APRIL  with  normal  futures,  this 
overhead  accounts  for  a  factor  of  Id.  Lazy  task  cre- 
a  ion  on  APRIL  creates  threads  only  when  the  machine 
has  the  resources  to  execute  them,  and  performs  much 
*er  because  it  ha*  the  e(Tect  of  dynamically  partition¬ 
ing  the  program  into  coarser-grain  threads  and  creat'  >g 
fewer  futures.  The  overhead  introduced  is  only  a  fac¬ 
tor  of  1.5.  In  ail  of  the  programs,  APRIL  consistently 
demonstrates  lower  overhead  due  to  support  for  thread 
Teation  and  synchronization  over  the  Encore. 

Measurements  for  multiple  processor  executions  on 
APRIL  (2  -  16)  used  the  processor  simulator  without 
the  cache  an  !  network  simulators,  in  effect  simulating  a 
shared-memory  machine  with  no  memory  latency.  The 
numbers  demonstrate  that  APRIL  and  its  run-time  sys¬ 
tem  allow  parallel  program  Performance  to  scale  when 
synchronization  and  task  creation  overheads  are  taken 
into  account,  but  when  memory  latency  is  ignored  The 
effect,  of  communication  in  large-scale  macnines  depends 
on  several  iac.tors  such  as  scheduling,  which  are  active 
areas  of  investigation. 

8  Scalability  of  Multithreaded 
Processor  Systems 

Multithreading  enhances  processor  efficiency  by  allow¬ 
ing  '  e  cution  to  proceed  on  alternate  threads  while  the 


memory  requests  of  other  threads  are  being  satisfied. 
However,  any  new  mechanism  is  useful  only  if  it  en¬ 
hances  overall  system  performance.  I  bis  section  ana¬ 
lyzes  the  system  performance  of  multithreaded  proces¬ 
sors. 

A  multithreaded  processor  design  must  address  the 
tradeoff  between  reduced  processor  idle  time  and  in¬ 
creased  .ache  miss  rates,  network  contention,  and  eon- 
text  management  overhead  The  private  working  sets  of 
multiple  contexts  interfere  in  the  cache.  The  added  in¬ 
terference  misses  coupled  with  the  higher  average  traffic 
generated  by  a  higher  utilized  processor  impose  greater 
bandwidth  demands  on  the  interconnection  netv  k. 
Context  management  instructions  required  to  switch 
the  processor  between  threads  also  add  to  the  over¬ 
head.  Furthermore,  the  application  must  display  suf¬ 
ficient  parallelism  to  allow  multiple  thread  assignment 
to  each  processor. 

What  is  a  good  performance  metric  to  evaluate  mul¬ 
tithreading?  A  good  measure  of  system  performance  is 
system  power,  which  is  the  product  of  the  number  of 
processors  and  the  aver;  ge  processor  utilization.  Pro¬ 
vided  the  computation  ol  processor  utilization  take,  into 
account  tlie  deleterious  effects  of  cache,  network,  and 
context-switching  overheed,  the  processor  utilization  is 
itself  a  good  measure 

We  have  developed  a  model  for  multithreaded  pro 
cessor  utilization  that  includes  the  cache,  netwo.k,  and 
switching  overhead  effects.  A  detailed  analysis  is  pre¬ 
sented  in  [1]  This  section  will  summarize  the  model 
and  our  chief  results.  Processor  utilizat  ion  V  as  a  funr- 
i  on  of  the  number  >>f  threads  resident  on  a  processor 
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p  is  derived  as  a  function  of  the  cache  miss  rate  m(p ), 
the  network  latency  T(p),  and  the  context  switching 
overhead  C: 


U(p)  = 


.  p  -  ■ 

1  +  7  (f')m(r) 


for  p  < 


1  +  T(p)m(p) 

l+Cm(p) 


1 

1+Cm(p) 


(0 


When  the  number  of  threads  is  small,  complete  over¬ 
lapping  of  network  latency  is  not  possible.  Processor 
utilization  with  one  thread  is  1/(1  +  m(  1)T(  1)).  Ideally, 
with  p  threads  available  to  overlap  network  delays,  the 
utilization  would  increase  p-f old.  In  practice,  because 
the  miss  rate  and  network  latency  increase  to  »n(p)  and 
T(p),  the  utilization  becomes  p/(  1  +  rn(p)T(p)). 

When  it  is  possible  to  completely  overlap  network 
latency,  processor  utilization  is  limited  only  by  the  con¬ 
text  switching  overhead  paid  on  every  miss  (assuming 
a  context  switch  happens  on  a  cache  miss),  and  is  given 
by  1/(1  +  m(p)C) 

The  models  for  the  cache  and  network  terms  have 
been  validated  through  simulations.  Both  these  terms 
are  shown  to  be  the  sum  of  two  components:  one  com¬ 
ponent  independent  of  the  number  of  threads  p  and  the 
other  linearly  related  to  p  (to  first  order).  Multithread¬ 
ing  is  shown  to  be  useful  when  p  is  small  enough  that 
the  fixed  components  dominate. 

Let  us  look  at  some  results  for  the  default  set  of  sys¬ 
tem  parameters  given  in  Table  4.  The  analysis  assumes 
8000  processors  arranged  in  a  three  dimensional  array. 
In  such  a  system,  the  average  number  of  hops  between  a 
random  pair  of  nodes  is  nk  / 3  =  20,  where  n  denotes  net¬ 
work  dimension  and  k  its  radix.  This  yields  ail  average 
round  trip  network  latency  of  55  cycles  for  an  unloaded 
network,  when  memory  latency  and  average  packet  size 
are  taken  into  account.  The  fixed  miss  rate  comprises 
first-time  fetches  of  blocks  into  the  cache,  and  the  inter¬ 
ference  due  to  multiprocessor  coherence  invalidations. 


Parameter 

Value 

Mem  ry  latency 

10  cycles 

Network  dimension  n 

3 

Network  radix  k 

20 

1  ixed  miss  rate 

2% 

Average  packet  size 

4 

Cache  block  size 

16  byt  s 

Thread  working  set  size 

250  blocks 

Cache  size 

64  Kbytes 

Table  4:  Default  system  parameters. 
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Figure  5:  Relative  sizes  of  the  cache,  network  and  overhead 
components  that  affect  processor  utilization. 


Figure  5  displays  processor  utilization  as  a  function  of 
the  number  of  threads  resident  on  the  processor  when 
context  switching  overhead  is  10  cycles.  The  degree 
to  which  the  cache,  network,  and  overhead  components 
impact  overall  processor  utilization  is  also  shown.  The 
ideal  curve  shows  the  increase  in  processor  utilization 
when  both  the  cache  miss  rate  and  network  contention 
correspond  to  that  of  a  single  process,  and  do  not  in¬ 
crease  with  the  degree  of  multithreading  p. 

We  see  that  as  few  as  three  processes  yield  close  to 
80%  utilization  for  a  ten-cycle  context-switch  overhead 
which  corresponds  to  our  initial  SPARC-based  imple¬ 
mentation  of  APRIL.  This  result  is  similar  to  that  re¬ 
ported  by  Weber  and  Gupta  [26]  for  coarse-grain  mul¬ 
tithreaded  processors.  The  main  reason  a  low  degree  of 
multithreading  is  sufficient  is  that  context  switches  are 
forced  only  on  cache  misses,  which  are  expected  to  hap¬ 
pen  infrequently.  The  marginal  benefits  of  additional 
processes  is  seen  to  decrease  due  to  network  and  cache 
interference. 

Why  is  utilization  limited  to  a  maximum  of  about 
0.80  despite  an  ample  supply  of  threads?  The  reason  is 
that  available  network  bandwidth  limits  the  maximum 
rate  at  which  computation  can  proceed.  When  avail¬ 
able  network  bandwidth  is  used  up,  adding  more  pro- 
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cesses  will  not  improve  processor  utilization.  On  t lie 
contrary,  more  processes  will  degrade  performance  due 
to  increased  cache  interference.  In  such  a  situation, 
for  better  system  performance,  effort  is  best  spent  in 
increasing  the  network  bandwidth,  or  in  reducing  the 
bandwidth  requirement  of  each  thread. 

The  relatively  large  ten-cycle  context  switch  overhead 
does  not  significantly  impact  performance  for  the  de¬ 
fault  set  of  parameters  because  utilization  depends  on 
the  product  of  context  switching  frequency  and  switch¬ 
ing  overhead,  and  the  switching  frequency  is  expected 
to  be  small  in  a  cache-based  system.  This  observation 
is  important  because  it  allows  a  simpler  processor  im¬ 
plementation,  and  is  exploited  in  the  design  of  APRIL. 

A  multithreaded  processor  requires  larger  caches  to 
sustain  the  working  sets  of  multiple  processes,  although 
cache  interference  is  mitigated  if  the  processes  share 
code  and  data.  For  the  default  parameter  set,  we  found 
that  caches  greater  than  64  Kbytes  comfortably  sus¬ 
tain  the  working  sets  of  four  processes.  Smaller  caches 
suffer  more  interference  and  reduce  the  benefits  of  mul¬ 
tithreading. 

9  Conclusions 

We  described  the  architecture  of  APRIL  -  a  coarse- 
grain  multithreaded  processor  to  be  used  in  a  cache- 
coherent  multiprocessor  called  ALEVVIFE.  By  rapidly 
switching  to  an  alternate  task,  APRIL  can  hide  com¬ 
munication  and  synchronization  delays  and  achieve  high 
processor  utilization.  The  processor  makes  effective  use 
of  available  network  bandwidth  because  it  is  rarely  idle. 
APRIL  provides  support  for  fine-grain  tasking  and  de¬ 
tection  of  futures-  It  achieves  high  single-thread  perfor¬ 
mance  by  executing  instructions  from  a  given  task  until 
an  exception  condition  like  a  synchronization  fault  or  re¬ 
mote  memory  operation  occurs.  Coherent  caches  reduce 
the  context  switch  rate  to  approximately  once  every  50- 
100  cycles.  Therefore  context  switch  overheads  in  the 
4-10  cycle  range  are  tolerable,  significantly  simplifying 
processor  design.  By  providing  hardware  support  only 
for  performance-critical  operations  and  migrating  other 
functionality  into  the  compiler  and  run-tinie  system,  we 
were  able  to  simplify  the  processor  design  even  further. 

We  described  a  SPARC-based  implementation  of 
APRIL  that  uses  the  register  windows  of  SPARC  as 
task  frames  for  multiple  threads.  A  processor  simulator 
and  an  APRIL  compiler  and  run-time  system  have  been 
written.  The  SPARC-based  implementation  of  APRIL 
switches  contexts  in  11  cycles.  APRIL  and  its  asso¬ 
ciated  run-time  system  practically  eliminate  the  over¬ 
head  of  fine-grain  task  creation  and  detection  of  fu¬ 
tures  For  Mul-T,  the  overhead  reduces  from  100% 


on  an  Encore  Multimax-based  implementation  to  under 
5%  on  APRIL  We  evaluated  the  scalability  of  multi¬ 
threaded  processors  in  large-scale  parallel  machines  us¬ 
ing  an  analytical  model.  For  typical  system  parameters 
and  a  10  cycle  context-switch  overhead,  the  processor 
can  achieve  close  to  80%  utilization  with  3  processor 
resident  threads. 
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